-
Notifications
You must be signed in to change notification settings - Fork 22
[Quantization] Support more than one quant-compressor #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice feature, agree with @kylesayrs 's recommendation, + updating docstrings and adding a test specifically for mixed precision compression/decompression
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Outdated
Show resolved
Hide resolved
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
cd324dd
to
8b5d4c9
Compare
Seems like there are 3 sources of truth for quantization format
It'd be nice if def get_model_compression_format(model: torch.nn.Module) -> Set[CompressionFormat]:
return set(
getattr_chain(module, "quantization_scheme.format", CompressionFormat.dense)
for module in model.modules()
)
|
We still support the global compression format to be overwritten but this is not a common pathway which is why it was not part of this PR change for the per-module case. Ideally, we can also update our preset schemes to include the compression formats as well. But again, not what this PR is targeting as not our typical user pathway. I agree we can remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested with model reloading? I see a couple potential issues there.
In the case where we want to load a model which has mixed compression
- from_pretrained_model and from_compression_config both set
quantization_config.format
to be"mixed"
. Ifquantization_config.format
is set,_fetch_unique_quantization_formats
will not be called - Since the model_compressor assumes that module formats have previously been set by
infer_per_module_quantization_format
and this function only, will this work for pathways in which we compress models without callinginfer_per_module_quantization_format
first?
There seems to be implicit coupling of infer_per_module_quantization_format
, ModelCompressor.from_pretrained_model
and ModelCompressor.compress/decompress
, where infer_per_module_quantization_format
must be called before the others. If we're going to do this, we should raise errors if a module has scheme.format = None
.
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Outdated
Show resolved
Hide resolved
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with the following list of follow-ups
Follow ups directly in scope of this PR
- Consider inferring compression format on a per-module. This enables users to manually specify formats (useful for debugging at the least), and more importantly decouples compression from requiring that
infer_quantization_format
be called prior.
def get_module_format(module):
qscheme = module.quantization_scheme
sscheme = module.sparsity_scheme # or from a map
inferred_format = infer_compression_format(qscheme, sscheme)
if qscheme is not None and qscheme != inferred_format:
# warn
...
We can still use a global override by passing the global override to this function
- Consider only inferring the
format
label at config serialization time, rather than prior. This avoids having to pass and parse the format in multiple places as well as avoids user or model loading code from accidentally passing "mixed" as a format.
def update_config(self, model):
config[QUANTIZATION_CONFIG_NAME].format = get_model_format(model)
def get_model_format(model):
return set(get_module_format(module) for module in model.modules())
Follow ups that are related but might make implementation easier
- Consider refactoring compressors into functions, not objects
def compress_model(model):
for name, module in model.named_modules():
format = get_compression_format(module)
module = compress_module(module, format)
set_module(model, name, module)
def compress_module(module, format):
if format == CompressionFormat.dense:
return module
if format == CompressionFormat.Sparse24:
return Sparse24Compressor.compress_module(module)
...
- Consider refactoring format to not be nullable. This reduces required parsing logic and tightens type hinting
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented/marked as resolved the thread on passing in quantization_config, one last comment on naming quantization_compressor
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
SUMMARY: - Requires neuralmagic/compressed-tensors#415 - Updates `infer_quantization_format` to be `infer_per_module_quantization_format` such that instead of returning a global format, a per module format is assigned to each module to be used during compression time. All unique compression formats are returned
Summary
ModelCompressor.quantization_compressor
to now be a dictionary, such that more than one quantization compressor can be supportedmixed-precision
as a newCompressionFormat
- if more than one format is found within the model,mixed-precision
is set as the model's global format in itsconfig.json
format
to theQuantizationScheme
and leverages this per-module format field in order to fetch the appropriate compressor to compress the modelModelCompressor.compress
andModelCompressor.decompress
- onlycompress_model
anddecompress_model
currently support this functionality as compress/decompress essentially only support global formatsTesting:
Next Steps:
Example Updates
New config: